:orphan:
Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset
====================================================================
In this notebook, we learn how to train a classifier with a more complex
multi-table data where a secondary table is itself a parent table of
another table (ie. snowflake schema). It is highly recommended to see
the *Basics 1* and *Basics 2* lessons if you are not familiar with
Khiops.
Make sure you have installed `Khiops `__ and
`Khiops Visualization `__.
We start by importing Khiops, checking its installation and defining
some helper functions:
.. code:: ipython3
import os
import platform
import subprocess
from khiops import core as kh
# Define helper functions
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Multi-Table Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We’ll train a multi-table classifier on a extension of dataset
``AccidentsSummary`` that we used in the previous notebook *Sklearn
Basics 2*. This dataset ``Accidents`` contains two additional tables
``Place`` and ``User`` and is organized in the following relational
snowflake schema:
::
Accident
|
| -- 1:n -- Vehicle
| |
| |-- 1:n -- User
|
| -- 1:1 -- Place
Note that the target variable is ``Gravity``.
To train the KhiopsClassifier for this setup, this schema must be
codified in the dictionary file. Let’s check the contents of the
``Accidents`` dictionary file:
.. code:: ipython3
accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = os.path.join(accidents_dataset_dir, "Accidents.kdic")
print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=45)
.. parsed-literal::
Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Categorical GPSCode;
Numerical Latitude;
Numerical Longitude;
Entity(Place) Place;
Table(Vehicle) Vehicles;
};
Dictionary Place(AccidentId)
{
Categorical AccidentId;
Categorical RoadType;
Categorical RoadNumber;
Categorical RoadSecNumber;
Categorical RoadLetter;
Categorical Circulation;
Numerical LaneNumber;
Categorical SpecialLane;
Categorical Slope;
Categorical RoadMarkerId;
Numerical RoadMarkerDistance;
Categorical Layout;
Numerical StripWidth;
Numerical LaneWidth;
Categorical SurfaceCondition;
Categorical Infrastructure;
Categorical Localization;
Categorical SchoolNear;
};
Dictionary Vehicle(AccidentId, VehicleId)
Note the following differences in comparison with the dictionary of
dataset ``AccidentsSummary``.
- The schema for the main table contains one extra special variable
defined with the statement ``Entity(Place) Place`` which indicate a
``1:1`` relationship between ``Accident`` and ``Place`` tables.
- The main table ``Accident`` and entity ``Place`` have the same key
``AccidentId``. Table ``Vehicle`` and its child table ``User`` have
two keys ``AccidentId`` and ``VehicleId``.
Now let’s store the location of the tables and peek their contents:
.. code:: ipython3
accidents_data_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table: {accidents_data_file}")
print("")
peek(accidents_data_file)
vehicles_data_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
places_data_file = os.path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table: {places_data_file}")
print("")
peek(places_data_file)
users_data_file = os.path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table: {users_data_file}")
print("")
peek(users_data_file)
.. parsed-literal::
Accidents data table: /github/home/khiops_data/samples/Accidents/Accidents.txt
AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress GPSCode Latitude Longitude
201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles M 50.55737 2.55737
201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul M 50.52936 2.52936
201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale M 50.51243 2.51243
201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde M 50.51974 2.51974
201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo M 50.51607 2.51607
201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39 M 50.52132 2.52132
201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin M 50.52211 2.52211
201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry M 50.53146 2.53146
201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité M 50.53707 2.53707
Vehicles data table: /github/home/khiops_data/samples/Accidents/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
Places data table: /github/home/khiops_data/samples/Accidents/Places.txt
AccidentId RoadType RoadNumber RoadSecNumber RoadLetter Circulation LaneNumber SpecialLane Slope RoadMarkerId RoadMarkerDistance Layout StripWidth LaneWidth SurfaceCondition Infrastructure Localization SchoolNear
201800000001 Departamental 41 C TwoWay 2 0 Flat RightCurve Normal Unknown Lane 00
201800000002 Communal 41 D TwoWay 2 0 Flat LeftCurve Normal Unknown Lane 00
201800000003 Departamental 39 D TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000004 Departamental 39 TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000005 Communal OneWay 1 0 Flat Straight Normal Unknown Lane 00
201800000006 Departamental 39 D Unknown 2 0 Uphill LeftCurve Wet Unknown Shoulder 00
201800000007 Departamental 41 D TwoWay 2 0 Flat 16 500 Straight Normal Unknown Shoulder 00
201800000008 Communal - TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000009 Departamental 141 D TwoWay 2 0 Flat Straight Normal Unknown Shoulder 00
Users data table: /github/home/khiops_data/samples/Accidents/Users.txt
AccidentId VehicleId Seat Category Gender TripReason SafetyDevice SafetyDeviceUsed PedestrianLocation PedestrianAction PedestrianCompany BirthYear
201800000001 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1960
201800000001 B01 1 Driver Male None SeatBelt Yes None None Unknown 1928
201800000002 A01 1 Driver Male None SeatBelt Yes None None Unknown 1947
201800000002 A01 Pedestrian Male None Helmet OnLane<=OnSidewalk0mCrossing Crossing Alone 1959
201800000003 A01 1 Driver Male Leisure Helmet Yes None None Unknown 1987
201800000003 C01 1 Driver Male None ChildrenDevice None None Unknown 1977
201800000004 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1982
201800000004 B01 1 Driver Male Leisure Helmet None None Unknown 2013
201800000005 A01 1 Driver Male Leisure Helmet Yes None None Unknown 2001
Train a classifier for the ``Accidents`` database with 1000 variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The call to the train_predictor is exactly the same as seen before on
the exercice of the previous notebook *Sklearn Basics 2*. The only
difference is the extension of the dictionary
``additional_data_tables``, which contains paths of the additional
tables, with two new paths:
- Path of entity ``Place`` is :literal:`Accident`Place`.
- Path of table ``User`` is :literal:`Accident`Vehicles`Users`.
Same as previously, we’ll ask Khiops to create 1000 additional features
with its multi-table AutoML mode.
Do not forget: - The target variable is ``Gravity`` - Set
``max_trees=0``
With these considerations, let’s now train the classifier:
.. code:: ipython3
accidents_results_dir = os.path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={
"Accident`Vehicles": vehicles_data_file,
"Accident`Place": places_data_file,
"Accident`Vehicles`Users": users_data_file,
},
max_constructed_variables=1000,
max_trees=0,
)
print(f"Accidents report file: {accidents_report}")
print(f"Accidents modeling dictionary file: {accidents_model_kdic}")
.. parsed-literal::
Accidents report file: exercises/Accidents/AllReports.khj
Accidents modeling dictionary file: exercises/Accidents/Modeling.kdic
Take a look to the report
^^^^^^^^^^^^^^^^^^^^^^^^^
Which variables predict well the gravity of an accident?
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(accidents_report)